The majority of the following analyses have been inspired by the pipeline used in the article Rossi et al. Cell Stem Cell 2020. The code associated with this publication is available on GitHub in the Cardiac_Gastruloids repository.
The goal is to analyze and compare the data obtained in our lab with the data retrieved from the Rossi et al. article. %A% mais notre goal c’est paas que de comparer ? On a l’impression que si vu la façon dont la phrase est formulée. Peut etre qu’il faut juste ajouter la description de notre dataset avant pour clarifier. %C% … The data used in the Rossi et al. article will be named “reference data” in the rest of the document. %A% and the data that were procuded in our lab ? Give the name here too. These data were recovered from the GEO website with the accession number GEO: GSE158999. The accessed files are:
barcodes.tsv.gz,
features.tsv.gz,
matrix.mtx.gz,
metadata.tsv.gz.
The Rossi et al. article also refers and uses the in vivo atlas from Pijuan-Sala Griffiths Gottgens et al. Nature 2019. Indeed, in Rossi et al. a classifier has been trained on the atlas data. %C% à reformuler I directly download the classifier from the above-mentioned GitHub repository.
I also downloaded the data of the atlas of Pijan-Sala et al. using a terminal with the following command lines
curl https://content.cruk.cam.ac.uk/jmlab/atlas_data.tar.gz > atlas_data.tar.gz
tar -zxvf atlas_data.tar.gz
To reproduce the analyses, it is recommended to set up all these directories. I assume that all the input files are stored in the same directory inputData. This directory is divided in 5 subdirectories: %A% “this” directory refers to ‘inputDData’ ? %C% mieux vaut que je répète ‘inputData’ ?
2 of them are dedicated to the raw data to be analyzed, from the Marseille Medical Genetics (MMG) laboratory and reference datasets (zaffranRawData and rossiRawData, respectively);
1 directory contains the data to create an atlas object from Pijuan-Sala et al. (atlas);
1 directory stores the classifier based on the atlas of Pijuan-Sala et al. and provided by the reference Rossi et al. (scmap_pijuansala_classifier),
1 directory contains the input tables provided by the reference Rossi et al. (InputTables). %A% c’est quoi ces tables ? C’est la liste données plus haut? %C% rien à voir avec la liste en-haut. Fichier tsv pour nommer les clusters par ex, ou pour utiliser toujours les mêmes couleurs. Il y a une liste de genes d’interet aussi (utilisée dans l’analyse de Rossi)
basePath <- "XXXXX" # to be seen as the root for any analysis based on a set of input data
# input paths
atlas.directory <- paste0(basePath, "inputData/atlas/")
classifier.folder <- paste0(basePath, "inputData/scmap_pijuansala_classifier/")
inputTables.folder <- paste0(basePath, "inputData/InputTables/")
# output paths
baseAnalysis <- paste0(basePath, "XXXXX")
if (!dir.exists(baseAnalysis)) {
dir.create(baseAnalysis)
}
rdsObjects <- paste0(baseAnalysis, "rdsObjects/")
if (!dir.exists(rdsObjects)) {
dir.create(rdsObjects)
}
atlas.folder <- paste0(baseAnalysis, "atlas/")
if (!dir.exists(atlas.folder)) {
dir.create(atlas.folder)
}
fig.folder <- paste0(baseAnalysis, "figures/")
if (!dir.exists(fig.folder)) {
dir.create(fig.folder)
}
To keep a coherent coloration on the plots throughout the analyses, I set up a vector of colors.
colors.table <- read.table(file = paste0(inputTables.folder, "ClusterColors.tsv"), sep = "\t", header = T,
comment.char = "", as.is = T)
colors.use_ab.initio <- setNames(colors.table$blind_friendly[!is.na(colors.table$ab_initio_identity)], colors.table$ab_initio_identity[!is.na(colors.table$ab_initio_identity)])
colors.use_transferred <- setNames(colors.table$blind_friendly[!is.na(colors.table$transferred_identity)],
colors.table$transferred_identity[!is.na(colors.table$transferred_identity)])
colors.use_stages <- setNames(c("#be465c", "#b04b3d", "#c86633", "#d0a63f", "#998138", "#90b03d", "#60a756",
"#45c097", "#5e8bd5", "#6d71d8", "#573585", "#bd80d5", "#b853a2", "#ba4b7d"), c("Day_04", "Day_05", "Day_06",
"Day_07", "Day_10", "E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5"))
Some functions can work with parallelization. I here configure how parallelization should work.
# Parallelization
plan(strategy = "multisession", workers = 4)
# plan(batchtools_slurm, template =
# '/shared/projects/mothard_in_silico_modeling/slurm_scripts/aa_seuratAn.tmpl')
options(future.globals.maxSize = 62914560000) #60GB
options(future.seed = 26)
The Read10X function only needs the directory in which the input files have been stored. The function creates an object interpreting the barcodes, features and matrix files.
The CreateSeuratObject function initialize a seurat object.
ref.dataset.folder <- paste0(basePath, "inputData/ref_data/")
ref.raw.data <- Read10X(data.dir = ref.dataset.folder, gene = 1)
ref.4d.SO <- CreateSeuratObject(counts = ref.raw.data, project = "recoverAnalysis") #, assay='RNA', min.cells=3, min.features=100)
# Dimensions of the object
head(ref.4d.SO)
dim(ref.4d.SO)
## An object of class Seurat
## 1 features across 30496 samples within 1 assay
## Active assay: RNA (1 features)
## [1] 23961 30496
The reference dataset contains 30496 cells, on which 23961 features (genes) were detected.
The metadata file was created in the Rossi et al. analysis and is provided to guide the users. I add the stage and Replicate metadata to the seurat object under the day and replicate column names, respectively.
ref.md <- read.table(file = paste0(ref.dataset.folder, "metadata.tsv.gz"), sep = "\t", header = TRUE, stringsAsFactors = F)
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$stage, col.name = "day")
ref.4d.SO <- AddMetaData(object = ref.4d.SO, metadata = ref.md$Replicate, col.name = "replicate")
Here are all the metadata information contained in the seurat object of the reference dataset: str(ref.4d.SO@meta.data)
## 'data.frame': 30496 obs. of 5 variables:
## $ orig.ident : Factor w/ 1 level "GAS": 1 1 1 1 1 1 1 1 1 1 ...
## $ nCount_RNA : num 53183 41938 43631 69730 64691 ...
## $ nFeature_RNA: int 6676 5225 5869 7090 7039 5568 6208 5684 3450 6052 ...
## $ day : chr "Day4" "Day4" "Day4" "Day4" ...
## $ replicate : int 0 0 0 0 0 0 0 0 0 0 ...
There are 5 metadata stored in a data frame. Each of them describe the cells. Besides the imported metadata from the external file (day and replicate), there are n_Count_RNA and nFeature_RNA information: %A% du coup ca fait 4 au total, les 2 importées et les 2 décrites ici, alors que tu parles de 5 ?
nCount_RNA is the number of reads counted in every droplet with a gel bead associated to a cell;
nFeature_RNA is the number of different features (genes) detected in every droplet.
We can get the different values a metadata can take by the following command. Here is an example for the metadata named day: unique(ref.4d.SO@meta.data$day)
## [1] "Day4" "Day5" "Day6" "Day7"
day, dataset) %C% Labels modification - creationFor the sake of the analysis, I need to manipulate the labels to make them coherent between all the labels. %A% pas très clair: cohérent avec quoi ? Avec les données de l’atlas et les données de MMG ? %C% In the reference dataset, the day metadata is loaded in the previous part. However, the I edit it in order to nicely sort them (5<11 and not the reverse).
# modify day label
ref.4d.SO@meta.data$day <- gsub("^(Day)([0-9]*)$", "\\1_0\\2", ref.4d.SO@meta.data$day)
# create new metadata named dataset
ref.4d.SO$dataset <- paste0("ref_", tolower(ref.4d.SO@meta.data$day))
Idents(ref.4d.SO) <- ref.4d.SO$dataset
ref_dataset_names <- unique(ref.4d.SO$dataset)
A new metadata has been created. The values this metadata can take are among ref_day_04, ref_day_05, ref_day_06, ref_day_07 depending on the origin and the stage (day) of the sequenced cells.
Now, here’s what the metadata looks like:
| orig.ident | nCount_RNA | nFeature_RNA | day | replicate | dataset | |
|---|---|---|---|---|---|---|
| GAS_Day4_batch4_AAACCCATCGACTCCT | GAS | 53183 | 6676 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAACGAAGTCATAACC | GAS | 41938 | 5225 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAACGCTCAAGCGAGT | GAS | 43631 | 5869 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAACGCTCACCCTAAA | GAS | 69730 | 7090 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAAGAACAGCCTTGAT | GAS | 64691 | 7039 | Day_04 | 0 | ref_day_04 |
| GAS_Day4_batch4_AAAGAACAGTTTCGAC | GAS | 39653 | 5568 | Day_04 | 0 | ref_day_04 |
For the following parts of the analysis, I will split the dataset ros.4d.SO into 4 sub-datasets. I use the SplitObject function from the Seurat package (PROVIDE REFERENCE). %A% yep By providing an attribute (metadata column name), the object is splitted into a list of subsetted objects.
I split the dataset according to the day the gastruloids were sequenced.%A% maybe explain why ? %C% Check the protocol, but it hsould be because of the sequencing was realized per days separately, it comes out that days are the entity. Before being merged,it is recommended to perform a quality control on every entity (see next part).
# Split the object
refSO.list <- SplitObject(ref.4d.SO, split.by = "dataset")
# set up the project name of each sub-dataset
for (SO in refSO.list) {
projName <- unique(SO@meta.data$dataset)
refSO.list[[projName]]@project.name <- projName
}
Size of the entire dataset:
| Dataset | Nbr of features | Nbr of cells |
|---|---|---|
| ref_4_days | 23961 | 30496 |
Sizes of the sub-datasets:
| Dataset | Nbr of features | Nbr of cells | |
|---|---|---|---|
| ref_day_04 | ref_day_04 | 23961 | 6898 |
| ref_day_05 | ref_day_05 | 23961 | 7783 |
| ref_day_06 | ref_day_06 | 23961 | 8313 |
| ref_day_07 | ref_day_07 | 23961 | 7502 |
The MMG lab got the scRNAseq data day by day. The sequencing saturation was not good enough for the day 5. As the consequence, this day has been resequenced. It hence have five datasets: one dataset for the days 4, 6 and 10 and two datasets for the day 5.
For each dataset, I will execute the following steps:
load and interpret raw data (barcodes, features and matrix files) with Read10X function and create a seurat object (CreateSeuratObject function),
add metadata.
There are five sub-datasets to load. I create a seurat object for each of them, and store them in a list. The project name of each sub-dataset is already given.
# get path to the folders of each sub-dataset
lab_dirs <- list.files(paste0(basePath, "inputData/lab_data"), full.names = TRUE)
print(lab_dirs)
# create a list of seurat objects
labSO.list <- list()
lab_dataset_names <- c()
for (x in lab_dirs) {
dataset.name <- str_extract(x, "[a-z0-9_]*$")
lab_dataset_names <- append(lab_dataset_names, dataset.name)
raw.data <- Read10X(data.dir = x, gene = 2)
SO <- CreateSeuratObject(counts = raw.data, project = dataset.name) #, assay='RNA', min.cells=3, min.features=100)
labSO.list[[dataset.name]] <- SO
}
labSO.list
## [1] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_04"
## [2] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_05"
## [3] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_05bis"
## [4] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_06"
## [5] "/shared/projects/mothard_in_silico_modeling/seurat_analysis/inputData/lab_data/lab_day_11"
## $lab_day_04
## An object of class Seurat
## 32286 features across 7324 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_05
## An object of class Seurat
## 32286 features across 7794 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_05bis
## An object of class Seurat
## 32286 features across 7729 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_06
## An object of class Seurat
## 32286 features across 6628 samples within 1 assay
## Active assay: RNA (32286 features)
##
## $lab_day_11
## An object of class Seurat
## 32287 features across 9584 samples within 1 assay
## Active assay: RNA (32287 features)
Similarly to the project names of the reference sub-datasets, the project names of the MMG lab sub-datasets take their values among: lab_day_04, lab_day_05, lab_day_05bis, lab_day_06, lab_day_11.
Similarly to the reference dataset, I need to add metadata to the MMG lab dataset. I will add the information of the day the gastruloids were sequenced, the dataset name - in a similar way as the reference dataset names were built - and the replicate.
The replicate metadata is required as there are two datasets sequenced on the day 5. It will be useful to differenciate cells between both datasets.
labSO.list <- lapply(labSO.list, function(SO) {
dataset <- SO@project.name
day <- str_to_title(str_extract(dataset, "day_[0-9]*$"))
SO <- AddMetaData(object = SO, metadata = day, col.name = "day")
SO <- AddMetaData(object = SO, metadata = dataset, col.name = "dataset")
if (dataset == "lab_day_05bis") {
SO <- AddMetaData(SO, metadata = 1, col.name = "replicate")
} else {
SO <- AddMetaData(SO, metadata = 0, col.name = "replicate")
}
})
data.frame(head(labSO.list[[1]]@meta.data)) %>%
knitr::kable(align = "lrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| orig.ident | nCount_RNA | nFeature_RNA | day | dataset | replicate | |
|---|---|---|---|---|---|---|
| AAACCCAAGTACAGAT | lab_day_04 | 59952 | 6798 | Day_04 | lab_day_04 | 0 |
| AAACCCAAGTTCATCG | lab_day_04 | 38805 | 5932 | Day_04 | lab_day_04 | 0 |
| AAACCCACAAGCGAAC | lab_day_04 | 45520 | 6008 | Day_04 | lab_day_04 | 0 |
| AAACCCACAAGGTTGG | lab_day_04 | 26327 | 4497 | Day_04 | lab_day_04 | 0 |
| AAACCCACACATATGC | lab_day_04 | 28763 | 4954 | Day_04 | lab_day_04 | 0 |
| AAACCCACACTAAACC | lab_day_04 | 27481 | 5004 | Day_04 | lab_day_04 | 0 |
The names of the cells have a different nomenclature in the reference and the MMG laboratory sub-datasets. On one hand, the reference cell names are already custom names with the biological origin (GAS for gastruloids), the days the gastruloid were sequenced, the batch number and the UMI. On the other hand, the MMG laboratory cell names only consists of the UMIs.
The library used by 10X Genomics is limited, and provide a limited number of unique UMI barcodes. (PROVIDE DETAILS). Hence, working with data obtained from multiple sequencing experiments might lead to an overlap of UMI barcodes between datasets.
I hence decided to create a nomenclature to rename all the cells under a unique identifier. Thus, each cell will be identified by:
ref/lab whether the data come from Rossi et al. or the MMG laboratory,
the dataset value previously added to the metadata, %A% c’est pas number plutot que value ici?
the replicate number, also present in the metadata and%A% c’est pas number plutot que value ici? %C% changed
the UMI provided by 10x Genomics.
rename_cells <- function(SO) {
cell_ids <- colnames(SO)
UMIs <- str_extract(cell_ids, "[A-Z]*$")
cellnames <- paste(SO$dataset, SO$replicate, UMIs, sep = "_")
SO <- RenameCells(SO, new.names = cellnames)
}
refSO.list <- lapply(refSO.list, rename_cells)
labSO.list <- lapply(labSO.list, rename_cells)
print(colnames(refSO.list$ref_day_04)[1])
print(colnames(labSO.list$lab_day_04)[1])
## [1] "ref_day_04_0_AAACCCATCGACTCCT"
## [1] "lab_day_04_0_AAACCCAAGTACAGAT"
We can see that there are as many unique cell identifiers as the number of cells among all the sub-datasets.
nbCells <- Reduce("+", lapply(refSO.list, function(SO) dim(SO)[2])) + Reduce("+", lapply(labSO.list, function(SO) dim(SO)[2]))
nbIDs <- Reduce("+", lapply(refSO.list, function(SO) length(unique(colnames(SO))))) + Reduce("+", lapply(labSO.list,
function(SO) length(unique(colnames(SO)))))
print(paste("Nbr of cells among all the sub-datasets:", nbCells))
print(paste("Nbr of cell identifiers:", nbIDs))
## [1] "Nbr of cells among all the sub-datasets: 69555"
## [1] "Nbr of cell identifiers: 69555"
Now, all the sub-datasets, either from the reference and the MMG lab data, are prepared for the analysis. %A% cette analyse concerne uniquement les lab data ou aussi les reference data? Si c’est les deux, comme c’est indiqué plus bas, l’indiquer ici. %C% modified First, I will control the quality of the cells, after what I will perform the standard preprocessing workflow of Seurat. %A% “after what I will realize” pas clair, réaliser est un faux-ami en anglais. Si tu veux dire : après cela j’applique le workflow de preprocessing standard de Seurat, c’est pas vraiment ce qui et écrit %C% modified Then, I will be able to remove the doublets using DoubletFinder library (PROVIDE SOURCE). %A% pourquoi “be able”? Tu pourrais pas le faire si tu avais pas appliqué les étapes précédentes? %C% il faut que les étapes de préprocess soient effectuées pour utiliser DF Finally, I will investigate if there are sources of unwanted variation among the replicates.%A% pourquoi que parmis les réplicats et pas entre les conditions ?
First and foremost, I gather all the data into a common object. Thus, they are stored in a list. %A% je comprends pas pourquoi “ainsi ils sont stockés dans une liste”? %C% simplification : First and foremost, I gather all the data into a list. The computational advantage of it takes place in the use of the function of lapply() . It allows to apply a function to all members of a list in one command.
All the sub-datasets (reference and MMG lab data) will be analyzed on the same pipeline.
allSO.list <- do.call(c, list(refSO.list, labSO.list))
# allSO.list <- append(refSO.list, labSO.list) # Equivalent allSO.list <- c(refSO.list, labSO.list) #
# Equivalent
SO.names <- names(allSO.list)
rm(refSO.list, labSO.list)
gc(verbose = FALSE)
str(allSO.list, max.level = 2)
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6380552 340.8 11687185 624.2 8825061 471.4
## Vcells 990866865 7559.8 1475152559 11254.6 1466330759 11187.3
## List of 9
## $ ref_day_04 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ ref_day_05 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ ref_day_06 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ ref_day_07 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_04 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_05 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_05bis:Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_06 :Formal class 'Seurat' [package "Seurat"] with 12 slots
## $ lab_day_11 :Formal class 'Seurat' [package "Seurat"] with 12 slots
Now, all sub-datasets Seurat objects are stored into a list. %A% c’est pas une répétition de ce qui est plus haut ? %C% un peu, mais en haut je précise l’avantage d’avoir une liste. Ici, ça me permet de dire que la liste est nommée. (visible dans le rapport grace a la commande str() juste au-dessus) The names of the list are the project.name of each sub-dataset.
As the number of cells could decrease during the following steps, I will print a table with the number of cells, but also the number of detected features in each sub-dataset. A barplot will be also displayed to see the evolution of each sub-dataset along the analysis.
Size of all the sub-datasets:
size_suball <- data.frame(cbind(sapply(allSO.list, function(SO) {
if (str_detect(SO@project.name, "^ref")) {
"ref"
} else {
"lab"
}
}), sapply(allSO.list, function(SO) {
if (str_detect(SO@project.name, "04")) {
"4"
} else if (str_detect(SO@project.name, "05bis")) {
"5bis"
} else if (str_detect(SO@project.name, "05")) {
"5"
} else if (str_detect(SO@project.name, "06")) {
"6"
} else if (str_detect(SO@project.name, "07")) {
"7"
} else if (str_detect(SO@project.name, "11")) {
"11"
}
}), t(sapply(allSO.list, dim)), "Preliminary Counts", "01"))
colnames(size_suball) <- c("origin", "day", "Nbr_of_features", "Nbr_of_cells", "Analysis_step_name", "Step_number")
size_suball$Nbr_of_cells <- as.integer(as.character(size_suball$Nbr_of_cells))
size_suball$Nbr_of_features <- as.integer(as.character(size_suball$Nbr_of_features))
size_suball$day <- factor(size_suball$day, levels = c("4", "5", "5bis", "6", "7", "11"), ordered = TRUE)
size_suball %>%
knitr::kable(align = "lrrrrlr") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| origin | day | Nbr_of_features | Nbr_of_cells | Analysis_step_name | Step_number | |
|---|---|---|---|---|---|---|
| ref_day_04 | ref | 4 | 23961 | 6898 | Preliminary Counts | 01 |
| ref_day_05 | ref | 5 | 23961 | 7783 | Preliminary Counts | 01 |
| ref_day_06 | ref | 6 | 23961 | 8313 | Preliminary Counts | 01 |
| ref_day_07 | ref | 7 | 23961 | 7502 | Preliminary Counts | 01 |
| lab_day_04 | lab | 4 | 32286 | 7324 | Preliminary Counts | 01 |
| lab_day_05 | lab | 5 | 32286 | 7794 | Preliminary Counts | 01 |
| lab_day_05bis | lab | 5bis | 32286 | 7729 | Preliminary Counts | 01 |
| lab_day_06 | lab | 6 | 32286 | 6628 | Preliminary Counts | 01 |
| lab_day_11 | lab | 11 | 32287 | 9584 | Preliminary Counts | 01 |
ggplot(size_suball, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = origin, fill = origin), position = position_dodge(),
width = 0.7) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) + ggtitle("Number of cells in each sub-dataset") +
facet_grid(. ~ origin, scales = "free_x", space = "free")
The cells can be filtered-out based on multiple criteria, like:
the number of features detected in a cell transcriptome,
the number of counts for each single cell or,
the percentage of mitochondrial counts.
The feature plot of the nCount_RNA an the percent.mt metadata will guide me regarding the threshold to apply for filtering out the low quality cells.
To determine the mitochondrial percentage in each cell, I use the mitochondrial gene name that always begins by “mt-”. The function PercentageFeatureSet enables to calculate the percentage of all counts that match the pattern. Here, I use the “mt-” pattern. (REFORMULATE)
allSO.list <- future_lapply(allSO.list, function(SO) {
SO[["percent.mt"]] <- PercentageFeatureSet(SO, pattern = "^mt-")
return(SO)
}, future.seed = 26)
The percentage of mitochondrial counts (or reads) is stored for each cell as a metadata under the name of percent.mt (see the example on the data of the lab, day 4).
| orig.ident | nCount_RNA | nFeature_RNA | day | dataset | replicate | percent.mt | |
|---|---|---|---|---|---|---|---|
| lab_day_04_0_AAACCCAAGTACAGAT | lab_day_04 | 59952 | 6798 | Day_04 | lab_day_04 | 0 | 2.340206 |
| lab_day_04_0_AAACCCAAGTTCATCG | lab_day_04 | 38805 | 5932 | Day_04 | lab_day_04 | 0 | 1.154490 |
| lab_day_04_0_AAACCCACAAGCGAAC | lab_day_04 | 45520 | 6008 | Day_04 | lab_day_04 | 0 | 1.001758 |
| lab_day_04_0_AAACCCACAAGGTTGG | lab_day_04 | 26327 | 4497 | Day_04 | lab_day_04 | 0 | 1.762449 |
| lab_day_04_0_AAACCCACACATATGC | lab_day_04 | 28763 | 4954 | Day_04 | lab_day_04 | 0 | 2.718771 |
| lab_day_04_0_AAACCCACACTAAACC | lab_day_04 | 27481 | 5004 | Day_04 | lab_day_04 | 0 | 3.114879 |
lapply(allSO.list, function(SO) {
VlnPlot(SO, features = c("percent.mt"), ncol = 3) + geom_hline(aes(yintercept = 10), color = "blue", size = 1.5) +
geom_hline(aes(yintercept = 1.5), color = "orange", size = 1.5)
})
The reference sub-datasets show a mitochondrial content that never exceed 15% of the transcripts in the cells. In addition, cells never have less than 1.5% of mitochondrial content. It is the reason why there is no orange line on the 4 first plots (corresponding to the plots on reference sub-datasets). It is possible that the data were already filtered. %A% tu peux pas le voir dans le code qu’ils donnent sur github? %C% dans le code github, il y a des filtres, mais sur les données qui sont publiées, ils ne servent à rien. Les données publiées sont clairement tronquées sur les cellules de mauvaise qualité. In contrast, some cells in the MMG lab data have a mitochondrial content close to 80%.
I will keep the cells whose percentage of mitochondrial genes is between 1.5% and 10% (orange and blue lines, resp.).
Information coming from 10X Genomics
Apoptotic cells express mitochondrial genes and export these transcripts to the cytoplasm in mammalian cells. Lysed cells with intact mitochondria may be partitioned into GEMs, increasing the fraction of mitochondrial transcripts detected.
It is also mentioned in Luecken MD and Theis FJ Mol Syst Biol. 2019 that a high mitochondrial content may represent doublets.
Rossi et al. was filtering out cells with a number of read counts lower than 2000 (green line). Added to this one, I will also remove the cells with a number of read counts higher than 150.000 (purple line).%A% Je comprends pas la formulation “Added to this one,” %C% In addition to this threshold,
lapply(allSO.list, function(SO) {
VlnPlot(SO, features = c("nCount_RNA"), ncol = 3) + geom_hline(aes(yintercept = 150000), color = "purple",
size = 1.5) + geom_hline(aes(yintercept = 3000), color = "green", size = 1.5)
})
The reference sub-datasets never have less than 3000 read counts. It is the reason why there is no purple line on the 4 first plots (plots on reference sub-datasets). Actually, it appears that they never have less than 5000 read counts.
Furthermore, the higher threshold leads to filter out some outliers either in reference and laboratory sub-datasets. %A% il faudrait vérifier la consistance et nommer les données de Fabienne/Stéphane toujours de la même manière: il y a un mélange de MMG, MMG lab, lab, laboratory … Je pense que MMG lab est le plus clair. %C% OKAY, à faire The 2 sub-datasets of the MMG lab fifth day do not reach such number of read counts. %A% je pense que les datasets splittés par jour sont appelés sub-datasets ailleurs, garder toujours la meme terminologie pour éviter les confusions. Sinon ca veut dire que tu ne considères aucun des deux dataset jour 5 ? %C% corrigé ; c’est seulement de la description des plots qui sont affichés en amont. Comme je n’ai aucun point au-dessus du seuil, la ligne ne s’affiche pas pour les plots mentionnés
The number of feature per cell is also assessed for quality control.
lapply(allSO.list, function(SO) {
VlnPlot(SO, features = c("nFeature_RNA"), ncol = 3) + geom_hline(aes(yintercept = 1800), color = "#56B4E9",
size = 1.5)
})
The reference sub-datasets never have less than 2000 features per cell, while the MMG lab sub-datasets range from 38 to 10064 features per cell. Many cells of the lab sub-datasets show a low number of detected features. Thus, a minimum of 1800 detected features per cell will be requested (blue line).
Among the thousands of detected features, some of them are expressed only in few cells. Such features do not provide information on the cell heterogeneity. %A% ben si au contraire, du coup c’est pas très clair %C% few cells ~ 10-15 cellules. Sur les jeux de données de 7000 cellules, ça n’apporte aucune information. ça pourrait aussi bien être de la contamination. It is also important to notice that filtering out features expressed in less than 50 cells (red dashed line) will make difficult to detect clusters that could gather less than 50 cells.
# Create dataframe
countcells.df.list <- lapply(allSO.list, function(SO) {
df <- data.frame(rowSums(SO@assays$RNA@counts != 0))
df$features <- rownames(df)
colnames(df) <- c("Nbr_of_cells", "features")
rownames(df) <- NULL
return(df)
})
lapply(1:length(countcells.df.list), function(i) {
project.name <- names(countcells.df.list)[i]
df <- countcells.df.list[[i]]
# Plot gene expression repartition
ggplot(df, aes(x = Nbr_of_cells)) + geom_histogram() + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) +
ylab("frequency") + scale_x_continuous(trans = "log10") + expand_limits(x = c(0, 10500), y = c(0,
2000)) + geom_vline(aes(xintercept = 8), color = "green", size = 1) + geom_vline(aes(xintercept = 50),
color = "red", size = 1, linetype = "dashed") + ggtitle(paste0("Repartition of expressed features\n",
project.name))
})
In order to limit the number of dropouts, while still being able to identify small clusters, the features expressed in less than 8 cells will be removed. %A% mais le nombre de dropout ne se controle pas non ? Tu controles juste le filtre? %C% c’est ça, en jouant sur les filtres, on peut limiter dans une petite mesure le nombre de dropout. remplacer “to limit” par “to minimize” ?
Considering the multiple quality control criteria alltogether, I confirm the previously settled thresholds. The low quality cells will be removed according to these selected thresholds.
# lapply(allSO.list, function(SO) { print(names(SO@meta.data)) FeatureScatter(object=SO,
# feature1='nCount_RNA', feature2='nFeature_RNA', pt.size=0.8, cols='percent.mt') + geom_point(data =
# data.frame(SO@meta.data$nCount_RNA, SO@meta.data$nFeature_RNA, SO@meta.data$percent.mt), aes(x =
# 'nCount_RNA', y = 'nFeature_RNA'), color='percent.mt') + geom_hline(aes(yintercept = 1500),
# color='#56B4E9', size=1) + geom_vline(aes(xintercept = 150000), color='purple', size=1) +
# geom_vline(aes(xintercept = 3000), color='green', size=1) + ggtitle(paste0('Nb of features as function
# of Nb of counts\n', SO@project.name)) + scale_color_continuous(type = 'viridis') })
lapply(allSO.list, function(SO) {
ggplot(SO@meta.data, aes(nCount_RNA, nFeature_RNA, colour = percent.mt)) + geom_point() + lims(colour = c(0,
100)) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + ylab("nFeature_RNA") + geom_hline(aes(yintercept = 1800),
color = "#56B4E9", size = 1) + geom_vline(aes(xintercept = 150000), color = "purple", size = 1) +
geom_vline(aes(xintercept = 3000), color = "green", size = 1) + ggtitle(paste0("Nb of features as function of Nb of counts\n",
SO@project.name))
})
The cells with a high content of mitochondrial genes are under the blue line (minimum of 1800 features), and also at the left of the green line (3.000 read counts minimum). Such cells with a low count depth (count per cell), few detected features and a high mitochondrial content are indicative of cells whose cytoplasmic mRNA has leaked out through a broken membrane, and thus, only mRNA located in the mitochondria is still conserved (Luecken MD and Theis FJ Mol Syst Biol. 2019).
In this section, I apply all the cutoffs determined in the previous steps. As a result, the sub-datasets will have either less features or less cells.
allSO.list <- lapply(1:length(allSO.list), function(i) {
df <- countcells.df.list[[i]]
SO <- allSO.list[[i]]
SO <- subset(SO, features = which(df$Nbr_of_cells > 8), subset = percent.mt > 1.5 & percent.mt < 10 &
nCount_RNA > 3000 & nCount_RNA < 150000 & nFeature_RNA > 1800)
return(SO)
})
names(allSO.list) <- c(ref_dataset_names, lab_dataset_names)
Size of all the sub-datasets after filtering:
size_suball2 <- data.frame(cbind(
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
}),
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "04")) { "4" }
else if (str_detect(SO@project.name, "05bis")) { "5bis" }
else if (str_detect(SO@project.name, "05")) { "5" }
else if (str_detect(SO@project.name, "06")) { "6" }
else if (str_detect(SO@project.name, "07")) { "7" }
else if (str_detect(SO@project.name, "11")) { "11" }
}),
t(sapply(allSO.list, dim)),
"QC_filtering",
"02"
))
colnames(size_suball2) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball2$Nbr_of_cells <- as.integer(as.character(size_suball2$Nbr_of_cells))
size_suball2$Nbr_of_features <- as.integer(as.character(size_suball2$Nbr_of_features))
size_suball2$day <- factor(size_suball2$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball, size_suball2)
size_suball_track %>%
knitr::kable(align = "lrrrrlr") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "100%", height = "200px")
rm(size_suball, size_suball2)
| origin | day | Nbr_of_features | Nbr_of_cells | Analysis_step_name | Step_number | |
|---|---|---|---|---|---|---|
| ref_day_04 | ref | 4 | 23961 | 6898 | Preliminary Counts | 01 |
| ref_day_05 | ref | 5 | 23961 | 7783 | Preliminary Counts | 01 |
| ref_day_06 | ref | 6 | 23961 | 8313 | Preliminary Counts | 01 |
| ref_day_07 | ref | 7 | 23961 | 7502 | Preliminary Counts | 01 |
| lab_day_04 | lab | 4 | 32286 | 7324 | Preliminary Counts | 01 |
| lab_day_05 | lab | 5 | 32286 | 7794 | Preliminary Counts | 01 |
| lab_day_05bis | lab | 5bis | 32286 | 7729 | Preliminary Counts | 01 |
| lab_day_06 | lab | 6 | 32286 | 6628 | Preliminary Counts | 01 |
| lab_day_11 | lab | 11 | 32287 | 9584 | Preliminary Counts | 01 |
| ref_day_041 | ref | 4 | 17928 | 6726 | QC_filtering | 02 |
| ref_day_051 | ref | 5 | 18282 | 7637 | QC_filtering | 02 |
| ref_day_061 | ref | 6 | 18247 | 8133 | QC_filtering | 02 |
| ref_day_071 | ref | 7 | 18279 | 7385 | QC_filtering | 02 |
| lab_day_041 | lab | 4 | 17679 | 5518 | QC_filtering | 02 |
| lab_day_051 | lab | 5 | 18786 | 6728 | QC_filtering | 02 |
| lab_day_05bis1 | lab | 5bis | 18582 | 6689 | QC_filtering | 02 |
| lab_day_061 | lab | 6 | 17927 | 5196 | QC_filtering | 02 |
| lab_day_111 | lab | 11 | 18354 | 6398 | QC_filtering | 02 |
ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name),
position = position_dodge(), width = 0.6) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Number of cells in each sub-dataset\nafter QC") + facet_grid(. ~ origin, scales = "free_x", space = "free")
SAY SOMETHING THAT ROSSI ET AL DATA WERE ALREADY PREPROCESSED %A% yep
To compare the gene expression across multiple cells, the data have to be normalized. I use the default parameters of the NormalizeData function. It means that the method used is called LogNormalize. With this method, the feature counts of each cell are divided by the total counts of that cell and multiplied by the scale.factor (10.000, by default). This is then natural-log transformed using log1p.
allSO.list <- future_lapply(allSO.list, NormalizeData, verbose = FALSE, future.seed = 26)
The normalization is calculated from the raw counts slot named counts. The result is then stored in an another slot named data.
data.frame(GetAssayData(allSO.list$ref_day_04, slot = "counts")[1:3, 1:5]) %>%
knitr::kable(caption = "Raw counts in 'counts' slot") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "800px", height = "200px")
data.frame(GetAssayData(allSO.list$ref_day_04, slot = "data")[1:3, 1:5]) %>%
knitr::kable(caption = "Normalized counts in 'data' slot") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "800px", height = "200px")
| ref_day_04_0_AAACCCATCGACTCCT | ref_day_04_0_AAACGAAGTCATAACC | ref_day_04_0_AAACGCTCAAGCGAGT | ref_day_04_0_AAACGCTCACCCTAAA | ref_day_04_0_AAAGAACAGCCTTGAT | |
|---|---|---|---|---|---|
| Xkr4 | 0 | 0 | 0 | 0 | 0 |
| Rp1 | 0 | 0 | 0 | 0 | 0 |
| Sox17 | 5 | 0 | 0 | 0 | 0 |
| ref_day_04_0_AAACCCATCGACTCCT | ref_day_04_0_AAACGAAGTCATAACC | ref_day_04_0_AAACGCTCAAGCGAGT | ref_day_04_0_AAACGCTCACCCTAAA | ref_day_04_0_AAAGAACAGCCTTGAT | |
|---|---|---|---|---|---|
| Xkr4 | 0.0000000 | 0 | 0 | 0 | 0 |
| Rp1 | 0.0000000 | 0 | 0 | 0 | 0 |
| Sox17 | 0.6628109 | 0 | 0 | 0 | 0 |
I continue with the standard preprocessing steps, using default parameters:
highly variable features (HVG) detection, with the function FindVariableFeatures , where 2000 features are identified as having the most cell-to-cell variation;
scale the data, via a linear transformation, using the ScaleData function, on all features (more details below);
perform linear reduction dimension with Principal Component Analysis (PCA) to identify the sources of heterogeneity in every sub-dataset, through 50 principal components (PCs);
plot an ElbowPlot to visualize the most informative PCs.
allSO.list <- future_lapply(allSO.list, function(SO) {
SO <- FindVariableFeatures(SO, nfeatures = 2000, verbose = FALSE)
SO <- ScaleData(SO, features = rownames(SO), verbose = FALSE)
SO <- RunPCA(SO, verbose = FALSE)
return(SO)
}, future.seed = 26)
The function ScaleData performs a linear transformation (so called scaling). In details, it shifts gene expressions to make the mean at 0 (negative values arise), and scales the gene expressions to have their variance equal to 1. Concretely, it standardizes the expression of each gene. This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate (<https://satijalab.org/seurat/articles/pbmc3k_tutorial.html>).%A% equal weight to what? %C% By this step, highly-expressed genes do not dominate over others, …
From the scaling step, a new slot named scale.data has been created. %A% c’est quoi “slot”? C’est dans la structure des objects seurat ; […] new slot named scale.data is added to the seurat object. Below is an example of the same features in the same cells as above.
data.frame(GetAssayData(allSO.list$ref_day_04, slot = "scale.data")[1:3, 1:5]) %>%
knitr::kable(caption = "Scaled data in 'scale.data' slot") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
scroll_box(width = "800px", height = "200px")
| ref_day_04_0_AAACCCATCGACTCCT | ref_day_04_0_AAACGAAGTCATAACC | ref_day_04_0_AAACGCTCAAGCGAGT | ref_day_04_0_AAACGCTCACCCTAAA | ref_day_04_0_AAAGAACAGCCTTGAT | |
|---|---|---|---|---|---|
| Xkr4 | -0.2033839 | -0.2033839 | -0.2033839 | -0.2033839 | -0.2033839 |
| Rp1 | -0.0341485 | -0.0341485 | -0.0341485 | -0.0341485 | -0.0341485 |
| Sox17 | 1.7263776 | -0.2826476 | -0.2826476 | -0.2826476 | -0.2826476 |
The PCA was performed with the identified HVG as input. The Elbow plots below show for each sub-datasets how much each PC is informative - in term of the percentage of variance.
In the analysis done by Rossi et al., the PCA was done to compute 30 PCs only. All the downstream analysis was considering all the PCs. %A% all the 30 PCs? La formulation est un peu bizarre %C% … Here, I use the default parameters of the RunPCA function, which computes 50 PCs.
lapply(allSO.list, function(SO) {
ElbowPlot(SO, ndims = 50) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 20)) +
geom_vline(aes(xintercept = 30), color = "purple", size = 1) + ggtitle(paste0(SO@project.name, "\nElbowPlot"))
})
By taking into account the advises from the Seurat - Guided Clustering Tutorial vignette, the downstream analysis will be done on the 30 first PCs for all the sub-datasets.
Then, I go through the community detection among the cells. The edges are constructed using a k-nearest neighbors (KNN) graph, based on the euclidean distances previously obtained on the PCA space. % using the first 30 components? The weights of the edges are based on the shared overlap in their local neighborhood, a method call Jaccard similarity. These 2 steps are done by using the FindNeighbors function. (https://satijalab.org/seurat/articles/pbmc3k_tutorial.html)
Once I have obtained a graph on the data, I run the FindClusters function. It applies the Louvain algorithm (by default). This algorithm optimizes the modularity of the graph.
Originally, the analysis done by Rossi et al. set the resolution parameter at 4.6. Here, I test the Louvain algorithm according to multiple levels of resolution (0.1, 0.5, 1, 2 and 5).
res <- c(0.1, 0.5, 1, 2, 5)
allSO.list <- future_lapply(allSO.list, function(SO) {
SO <- FindNeighbors(SO, dims = 1:30, verbose = FALSE)
SO <- FindClusters(SO, resolution = res, random.seed = 11, verbose = FALSE)
return(SO)
}, future.seed = 26)
The relationship between the resolution’s levels and the number of clusters is presented below.
data <- c()
resolution <- c()
nb_clusters <- c()
for (SO in allSO.list) {
for (ares in res) {
nb_levels <- nlevels(SO@meta.data[[paste0("RNA_snn_res.", ares)]])
data <- append(data, SO@project.name)
resolution <- append(resolution, ares)
nb_clusters <- append(nb_clusters, nb_levels)
}
}
df <- data.frame(data, resolution, nb_clusters)
df %>%
pivot_wider(names_from = "data", values_from = "nb_clusters") %>%
knitr::kable(caption = "Number of clusters in sub-datasets\naccording to the resolution level") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| resolution | ref_day_04 | ref_day_05 | ref_day_06 | ref_day_07 | lab_day_04 | lab_day_05 | lab_day_05bis | lab_day_06 | lab_day_11 |
|---|---|---|---|---|---|---|---|---|---|
| 0.1 | 5 | 5 | 6 | 7 | 3 | 4 | 4 | 5 | 10 |
| 0.5 | 10 | 10 | 11 | 14 | 9 | 8 | 8 | 14 | 15 |
| 1.0 | 13 | 14 | 17 | 18 | 15 | 15 | 12 | 18 | 24 |
| 2.0 | 25 | 23 | 24 | 27 | 24 | 22 | 22 | 29 | 32 |
| 5.0 | 44 | 43 | 44 | 46 | 47 | 46 | 44 | 48 | 48 |
By running a non-linear dimension reduction such as the Uniform Manifold Approximation and Projection (UMAP), the clusters can be visualized.
allSO.list <- lapply(allSO.list, RunUMAP, dims = 1:30, verbose = FALSE)
for (ares in res) {
print(DimPlot(allSO.list$lab_day_11, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") +
ggtitle(paste0(allSO.list$lab_day_11@project.name, "\nResolution level: ", ares)) + theme(plot.title = element_text(hjust = 0.5)))
}
For the sub-dataset day_11 of the MMG lab, it appears that the best value for the resolution parameter should be 1. %A% et pour les autres? Pourquoi préciser seulement pour celui-la? %C% As an example, for the sub-dataset … I set the Idents of all sub-datasets to the clusters defined under the resolution level 1.%A% je comprends pas ce que c’est le idents ? %C% c’est une sorte de pointeur vers la meta donnes à utiliser par defaut, très utile pour les plots (ici, label = TRUE, va afficher les numeros de cluster pour la resolution 1) déjà vu l.270.
allSO.list <- lapply(allSO.list, function(SO) {
Idents(SO) <- SO$RNA_snn_res.1
print(DimPlot(SO, label = TRUE, reduction = "umap") + ggtitle(paste0(SO@project.name, "\nResolution level: 1")) +
theme(plot.title = element_text(hjust = 0.5)))
return(SO)
})
gc()
In single-cell RNA sequencing (scRNAseq), the cells in suspension are captured in a droplet with barcoded microparticles (REFERENCE TO MACOSKO 2015). In most cases, a droplet contains one microparticle and one cell. But sometimes, two or more cells might be captured within a droplet. In that case, we are in presence of a doublet or multiplet.
Doublets are considered as technical artifacts that have to be filtered out. In this part, I will use the R package DoubletFinder (REFERENCE PAPIER DOUBLET FINDER).
This tool require the following steps to be completed.
homotypic doublet proportion estimation; %A% juste en dessous le titre c’est heterotypic, c’est bizarre
pK parameter optimization
Two types of doublets have been identified:
homotypic doublets, that derive from transcriptionally similar cells,
heterotypic doublets, that derived from transcriptionally distinct cells.
In fact, DoubletFinder is sensitive to the heterotypic doublets, but not to the homotypic ones (from DoubletFinder github repository). %A% ca veeut dire quoi “is sensitive”? Is able to detect ? Ca a l’air d’etre un defaut alors que je pense que c’est une qualité ? %C% oui, sensitive = “able to detect”.
To better identify the heterotypic doublets, it is neccessary to rely on cell annotation.
The cells need to be annotated for the identification of heterotypic doublets. I will hence annotate the cells of the MMG lab dataset in a similar way as it is done in the pipeline of Rossi et al. I directly load the classifier provided by Rossi et al. for 1165 gene markers. This classifier was generated by Rossi et al. using the data from the Pijuan-Sala et al. atlas. I prefer to warn you that this classifier has nothing related to artificial intelligence classifiers. %A% je mettrais pas ca, je vois pas ce que ca apporte comme info, on se demande pourquoi tu nous avertis, et ca fait poser plein de questions. Tu avais aussi dit au début “a classifier has been trained” ce qui est typiquement de la syntaxe ML/IA, donc c’est bizarre. %C% vu dans l’intro, à modifier.
scmap_classifier <- readRDS(file = paste0(classifier.folder, "scmap_classifier_1000markers.rds"))
ref_stages <- c("E6.5", "E6.75", "E7.0", "E7.25", "E7.5", "E7.75", "E8.0", "E8.25", "E8.5")
I use the scmap package from Kiselev, V., Yiu, A. & Hemberg, M., Nat Methods 2018.
allSO.list <- lapply(allSO.list, function(SO) {
set.seed(1234567)
# create the corresponding SingleCellExperiment object
sce <- as.SingleCellExperiment(x = SO)
rowData(sce)$feature_symbol <- rownames(sce)
counts(sce) <- as.matrix(counts(sce))
logcounts(sce) <- as.matrix(logcounts(sce))
# apply scmapCluster
scmapCluster_results <- scmapCluster(projection = sce, index_list = scmap_classifier[ref_stages], threshold = 0)
# add the celltype from the in vivo atlas to the Seurat object
SO <- AddMetaData(object = SO, metadata = scmapCluster_results$combined_labs, col.name = "celltype_DF")
# save memory, do garbage collection
rm(sce)
gc()
return(SO)
})
lapply(allSO.list, function(SO) {
Idents(SO) <- SO$celltype_DF
DimPlot(SO, pt.size = 0.8, label = T, label.size = 6, repel = T, cols = colors.use_transferred[levels(Idents(SO))]) +
theme_void() + NoLegend() + ggtitle(SO@project.name) + theme(plot.title = element_text(hjust = 0.5,
size = 20))
})
As mentioned earlier, DoubletFinder is only sensitive to heterotypic doublets. By obtaining the proportion of homotypic doublets, I prevent DoubletFinder from overestimating the doublets to be removed across cells.%A% je suis pas sure de comprendre, surtout que tu dis que tu fais l’estimation que pour homotypic mais la section suivante parle de l’estimation pour heterotypic %C% je vais revoir l’introduction de cette partie.
homo_prop <- sapply(allSO.list, function(SO) {
# based on annotation (celltype from pijuansala classifier) could be also the louvain clusters determined
# from the preprocessus step
annotations <- SO@meta.data$celltype_DF
homotypic.prop <- modelHomotypic(annotations)
return(homotypic.prop)
})
Now, I have an estimated number of homotypic doublets proportion in each of the datasets.
| Homotypic doublets proportion | |
|---|---|
| ref_day_04 | 0.1828770 |
| ref_day_05 | 0.1280660 |
| ref_day_06 | 0.2150622 |
| ref_day_07 | 0.1463046 |
| lab_day_04 | 0.1394656 |
| lab_day_05 | 0.1995816 |
| lab_day_05bis | 0.1997156 |
| lab_day_06 | 0.2937258 |
| lab_day_11 | 0.1429649 |
The percentage of doublets is dependant of the number of cells loaded to the device before sequencing, and the prevalence of cell aggregates within the cell suspension.
In the Rossi et al. analysis, this rate was estimated at 7.6%. %A% which rate ? Pourqoui ici on donne la valeur a priori alors que pour les homotypic on fait un calcul sur les données? %C% j’introduirai le calcul de DF plus tôt, dans l’introduction de cette partie. “this rate” vient de 10X genomics. Pour le trouver, il faut consulter le manuel de 10X Here, according to the user guide relative to our sequencing analysis, and given that arround 20.000 cells were loaded for each experiment, I will use the rate 8.0% on the MMG lab data.
So, I apply the percentage of multiplets on the number of cells in each of the sub-datasets. Then, to estimate the number of heterotypic doublets, I multiply the result by the proportion of heterotypic proportion (1 - homotypic doublet proportion).
dblts_nonhomo <- sapply(1:length(allSO.list), function(i) {
# based on annotation (celltype from pijuansala classifier) could be also the louvain clusters determined
# from the preprocessus step
SO <- allSO.list[[i]]
homotypic.prop <- homo_prop[[i]]
if (any(grepl("ref", SO@project.name))) {
nDoublets <- round(ncol(SO) * 7.6/100)
} else {
nDoublets <- round(ncol(SO) * 8/100)
}
nDoublets_nonhomo <- round(nDoublets * (1 - homotypic.prop))
return(nDoublets_nonhomo)
})
names(dblts_nonhomo) <- SO.names
Below, there is the estimated number of heterotypic doublets to remove.
| Heterotypic estimated doublets | |
|---|---|
| ref_day_04 | 418 |
| ref_day_05 | 506 |
| ref_day_06 | 485 |
| ref_day_07 | 479 |
| lab_day_04 | 379 |
| lab_day_05 | 431 |
| lab_day_05bis | 428 |
| lab_day_06 | 294 |
| lab_day_11 | 439 |
DoubletFinder relies on 3 main parameters:
pN, the number of artificial generated doublets - by default = 0.25
pK, the PC neighborhood size used to compute proportion of artificial nearest neighbors (pANN), to be determined
nExp, the number of predicted doublets, obtained from the previous step.
First, I will use the default value of the pN parameter (i.e. 0.25), as it has been shown in the DoubletFinder article that it has no influence on the tool efficiency.
Second, I previously determined the number of heterotypic doublets to remove for each sub-dataset (see the table above).
Finally, I just have to determine the value of the pK parameter, using the ground-truth-agnostic parameter selection strategy (i.e. BCMVN maximization).
pK.list <- sapply(allSO.list,
function(SO){
sweep.res.list_SO <- paramSweep_v3(SO, PCs = 1:30) # as estimated from PC elbowPlot
sweep.stats_SO <- summarizeSweep(sweep.res.list_SO, GT = FALSE)
bcmvn_SO <- find.pK(sweep.stats_SO)
ggplot(bcmvn_SO, aes(pK, BCmetric, group = 1)) +
geom_point() +
geom_line()
pK <- bcmvn_SO %>% # select the pK that corresponds to max bcmvn to optimize doublet detection
filter(BCmetric == max(BCmetric)) %>%
select(pK)
pK <- as.numeric(as.character(pK[[1]]))
return(pK)
})
names(pK.list) <- SO.names
Below, there is the selected value for the pK parameter for each sub-dataset.
| pK parameter values | |
|---|---|
| ref_day_04 | 0.18 |
| ref_day_05 | 0.16 |
| ref_day_06 | 0.13 |
| ref_day_07 | 0.30 |
| lab_day_04 | 0.18 |
| lab_day_05 | 0.25 |
| lab_day_05bis | 0.29 |
| lab_day_06 | 0.19 |
| lab_day_11 | 0.24 |
With all parameters set I can run DoubletFinder on the sub-datasets.
allSO.list <- lapply(allSO.list, function(SO) {
SO <- doubletFinder_v3(SO, PCs = 1:30, pN = 0.25, pK = pK.list[[SO@project.name]], nExp = dblts_nonhomo[[SO@project.name]])
})
By running DoubletFinder on the sub-datasets, a new metadata appears in the Seurat objects. The name of this new matadata is DF.classifications, and every cell is tagged either by Singlet or Doublet.
SHOW TABLE DE SO@METADATA$DF.CLASSIFICATIONS
Finally, I just have to subset the sub-datasets on the Singlet tagged cells.
allSO.list <- lapply(allSO.list, function(SO) {
# remove the cells identified as doublets of each of the sub-datasets
col_dblts <- grep("DF.classifications", colnames(SO@meta.data), value = TRUE)
Idents(SO) <- col_dblts
SO <- subset(SO, idents = "Singlet")
# remove useless columns
SO@meta.data[[grep("DF.classifications", colnames(SO@meta.data))]] <- NULL
names(SO@meta.data)[names(SO@meta.data) == grep("pANN", colnames(SO@meta.data), value = T)] <- "pANN"
return(SO)
})
Size of all the sub-datasets after doublet removal:
size_suball3 <- data.frame(cbind(
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "^ref")){ "ref" } else { "lab" }
}),
sapply(allSO.list, function(SO){
if (str_detect(SO@project.name, "04")) { "4" }
else if (str_detect(SO@project.name, "05bis")) { "5bis" }
else if (str_detect(SO@project.name, "05")) { "5" }
else if (str_detect(SO@project.name, "06")) { "6" }
else if (str_detect(SO@project.name, "07")) { "7" }
else if (str_detect(SO@project.name, "11")) { "11" }
}),
t(sapply(allSO.list, dim)),
"Removed Doublets",
"03"
))
colnames(size_suball3) <- c('origin', 'day', 'Nbr_of_features', 'Nbr_of_cells', 'Analysis_step_name', 'Step_number')
size_suball3$Nbr_of_cells <- as.integer(as.character(size_suball3$Nbr_of_cells))
size_suball3$Nbr_of_features <- as.integer(as.character(size_suball3$Nbr_of_features))
size_suball3$day <- factor(size_suball3$day, levels = c('4', '5', '5bis', '6', '7', '11'), ordered = TRUE)
size_suball_track <- rbind(size_suball_track, size_suball3)
knitr::kable(size_suball_track)
rm(size_suball3)
| origin | day | Nbr_of_features | Nbr_of_cells | Analysis_step_name | Step_number | |
|---|---|---|---|---|---|---|
| ref_day_04 | ref | 4 | 23961 | 6898 | Preliminary Counts | 01 |
| ref_day_05 | ref | 5 | 23961 | 7783 | Preliminary Counts | 01 |
| ref_day_06 | ref | 6 | 23961 | 8313 | Preliminary Counts | 01 |
| ref_day_07 | ref | 7 | 23961 | 7502 | Preliminary Counts | 01 |
| lab_day_04 | lab | 4 | 32286 | 7324 | Preliminary Counts | 01 |
| lab_day_05 | lab | 5 | 32286 | 7794 | Preliminary Counts | 01 |
| lab_day_05bis | lab | 5bis | 32286 | 7729 | Preliminary Counts | 01 |
| lab_day_06 | lab | 6 | 32286 | 6628 | Preliminary Counts | 01 |
| lab_day_11 | lab | 11 | 32287 | 9584 | Preliminary Counts | 01 |
| ref_day_041 | ref | 4 | 17928 | 6726 | QC_filtering | 02 |
| ref_day_051 | ref | 5 | 18282 | 7637 | QC_filtering | 02 |
| ref_day_061 | ref | 6 | 18247 | 8133 | QC_filtering | 02 |
| ref_day_071 | ref | 7 | 18279 | 7385 | QC_filtering | 02 |
| lab_day_041 | lab | 4 | 17679 | 5518 | QC_filtering | 02 |
| lab_day_051 | lab | 5 | 18786 | 6728 | QC_filtering | 02 |
| lab_day_05bis1 | lab | 5bis | 18582 | 6689 | QC_filtering | 02 |
| lab_day_061 | lab | 6 | 17927 | 5196 | QC_filtering | 02 |
| lab_day_111 | lab | 11 | 18354 | 6398 | QC_filtering | 02 |
| ref_day_042 | ref | 4 | 17928 | 6308 | Removed Doublets | 03 |
| ref_day_052 | ref | 5 | 18282 | 7131 | Removed Doublets | 03 |
| ref_day_062 | ref | 6 | 18247 | 7648 | Removed Doublets | 03 |
| ref_day_072 | ref | 7 | 18279 | 6906 | Removed Doublets | 03 |
| lab_day_042 | lab | 4 | 17679 | 5139 | Removed Doublets | 03 |
| lab_day_052 | lab | 5 | 18786 | 6297 | Removed Doublets | 03 |
| lab_day_05bis2 | lab | 5bis | 18582 | 6261 | Removed Doublets | 03 |
| lab_day_062 | lab | 6 | 17927 | 4902 | Removed Doublets | 03 |
| lab_day_112 | lab | 11 | 18354 | 5959 | Removed Doublets | 03 |
ggplot(size_suball_track, aes(x = day, y = Nbr_of_cells)) + geom_col(aes(color = Analysis_step_name, fill = Analysis_step_name),
position = position_dodge(), width = 0.6) + theme_classic() + theme(plot.title = element_text(hjust = 0.5)) +
ggtitle("Number of cells in each sub-dataset\nafter doublets removal") + facet_grid(. ~ origin)
INTRODUCE WHY TO REDO ALL SEURAT STANDARD PREPROCESSING WORKFLOW because of removing cells, like during QC
I run again the complete Seurat standard preprocessing workflow, with the same parameters as before. See more details in the section 7.
res <- seq(0.8, 1.6, 0.2)
allSO.list <- lapply(allSO.list, function(SO) {
SO <- NormalizeData(SO, verbose = FALSE)
SO <- FindVariableFeatures(SO, nfeatures = 2000, verbose = FALSE)
SO <- ScaleData(SO, features = rownames(SO), verbose = FALSE)
SO <- RunPCA(SO, verbose = FALSE)
ElbowPlot(SO, ndims = 50) + theme_minimal() + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(aes(xintercept = 30),
color = "purple", size = 1) + ggtitle(paste0("ElbowPlot\n", SO@project.name))
SO <- FindNeighbors(SO, dims = 1:30, verbose = FALSE)
SO <- FindClusters(SO, resolution = res, random.seed = 11, verbose = FALSE)
SO <- RunUMAP(SO, dims = 1:30)
for (ares in res) {
print(DimPlot(SO, group.by = paste0("RNA_snn_res.", ares), label = TRUE, reduction = "umap") + ggtitle(paste0("Resolution level: ",
ares, "\n", SO@project.name)) + theme(plot.title = element_text(hjust = 0.5)))
}
return(SO)
})